library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
[37m── [1mAttaching packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──[39m
[37m[32m✓[37m [34mggplot2[37m 3.3.2 [32m✓[37m [34mpurrr [37m 0.3.4
[32m✓[37m [34mtibble [37m 3.0.1 [32m✓[37m [34mdplyr [37m 1.0.0
[32m✓[37m [34mtidyr [37m 1.1.0 [32m✓[37m [34mstringr[37m 1.4.0
[32m✓[37m [34mreadr [37m 1.3.1 [32m✓[37m [34mforcats[37m 0.5.0[39m
package ‘ggplot2’ was built under R version 3.6.2package ‘tibble’ was built under R version 3.6.2package ‘tidyr’ was built under R version 3.6.2package ‘purrr’ was built under R version 3.6.2package ‘dplyr’ was built under R version 3.6.2[37m── [1mConflicts[22m ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31mx[37m [34mdplyr[37m::[32mfilter()[37m masks [34mstats[37m::filter()
[31mx[37m [34mdplyr[37m::[32mlag()[37m masks [34mstats[37m::lag()[39m
library(dplyr)
1.1 Load the diamonds.csv data set and undertake an initial exploration of the data. You will find a description of the meanings of the variables on the relevant Kaggle page
diamonds <- read_csv("diamonds.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
carat = [32mcol_double()[39m,
cut = [31mcol_character()[39m,
color = [31mcol_character()[39m,
clarity = [31mcol_character()[39m,
depth = [32mcol_double()[39m,
table = [32mcol_double()[39m,
price = [32mcol_double()[39m,
x = [32mcol_double()[39m,
y = [32mcol_double()[39m,
z = [32mcol_double()[39m
)
diamonds
summary(diamonds)
X1 carat cut color clarity depth table price
Min. : 1 Min. :0.2000 Length:53940 Length:53940 Length:53940 Min. :43.00 Min. :43.00 Min. : 326
1st Qu.:13486 1st Qu.:0.4000 Class :character Class :character Class :character 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
Median :26970 Median :0.7000 Mode :character Mode :character Mode :character Median :61.80 Median :57.00 Median : 2401
Mean :26970 Mean :0.7979 Mean :61.75 Mean :57.46 Mean : 3933
3rd Qu.:40455 3rd Qu.:1.0400 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
Max. :53940 Max. :5.0100 Max. :79.00 Max. :95.00 Max. :18823
x y z
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
Median : 5.700 Median : 5.710 Median : 3.530
Mean : 5.731 Mean : 5.735 Mean : 3.539
3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
Max. :10.740 Max. :58.900 Max. :31.800
library(ggiraphExtra)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(GGally)
1.2 We expect the carat of the diamonds to be strong correlated with the physical dimensions x, y and z. Use ggpairs() to investigate correlations between these four variables.
ggpairs(diamonds)
1.3 So, we do find significant correlations. Let’s drop columns x, y and z from the dataset, in preparation to use only carat going forward.
diamonds_xyz <- diamonds %>%
select(-x, -y, - z)
diamonds_xyz
NA
1.4 We are interested in developing a regression model for the price of a diamond in terms of the possible predictor variables in the dataset. i. Use ggpairs() to investigate correlations between price and the predictors (this may take a while to run, don’t worry, make coffee or something).
ggpairs(diamonds_xyz)
1.4 i. Perform further ggplot visualisations of any significant correlations you find.
diamonds_xyz %>%
ggplot(aes(y = carat)) +
geom_boxplot()
plot_diamonds <- diamonds_xyz %>%
ggplot(aes(x = color, y = carat)) +
geom_point()
plot_diamonds
plot_diamonds <- diamonds_xyz %>%
ggplot(aes(x = price, y = carat)) +
geom_point()
plot_diamonds
1.5 Shortly we may try a regression fit using one or more of the categorical predictors cut, clarity and color, so let’s investigate these predictors: i. Investigate the factor levels of these predictors. How many dummy variables do you expect for each of them?
library(fastDummies)
Use the dummy_cols() function in the fastDummies package to generate dummies for these predictors and check the number of dummies in each case.
diamonds_xyz_dummy <- diamonds_xyz %>%
dummy_cols(select_columns = "carat", remove_first_dummy = TRUE, remove_selected_columns = TRUE)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
diamonds_xyz_dummy